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This paper puts dimension reduction into the his- 
torical context of sufficiency, efficiency and principal 
component analysis, and opens up an avenue toward 
efficient dimension reduction via maximum likeli- 
hood estimation of inverse regression. I congratulate 
Professor Cook for this insightful and groundbreak- 
ing work. My discussion will focus on two points 
that explore and extend Cook's ideas. The first is 
about the relationship between the principal com- 
ponent analysis of the predictor and the regression 
of the response on the predictor; the second explores 
various ways of extending Cook's inverse regression 
to characterize and estimate variance components. 

1. PCA OF X AND REGRESSION OF Y 

In his paper Professor Cook has told an intrigu- 
ing and fascinating history of the opposing views re- 
garding the relationship between the principal com- 
ponent analysis of X and the regression of Y on X. 
On the one hand, it is often the case in practice that 
the first few principal components of X tend to have 
higher correlations with Y than the other principal 
components of X, but on the other hand there seems 
no logical reason to believe that the direction along 
which X varies the most should somehow have a re- 
lation with Y. In this section I ask, and attempt to 
answer, the following question: is it possible for the 
first principal component of X to have higher corre- 
lation with Y (than the other principal components 
of X) even if nature is "neutral" in assigning a rela- 
tion between X and Y and "arbitrary" in assigning 
a covariance matrix to X? 
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To pursue this curiosity let us consider the fol- 
lowing situation. Let M^ xp be the collection of all 
p by p positive definite matrices, and let F be a 
distribution over R^ xp that is in some sense uni- 
form. Suppose nature randomly selects a covariance 
matrix E according to F, and generates X from 
iV(0, E). Furthermore, suppose that nature selects 
a linear relation between X and Y completely in- 
dependently of the way it selected E; that is, Y = 
f3 T X + e, where (3 is a random vector in MP, (3 _LL 
(E,X), and e _LL pT,/3,E) (here _LL indicates inde- 
pendence). Let vi, . . . ,v p be the eigenvectors of the 
random matrix E, arranged so that their eigenvalues 
satisfy X(v\) >•••> \(v p ). Let pi(f3, E) be the corre- 
lation coefficient between vj X and Y, conditioning 
on f3 and E. Thus pi((3, E), . . . ,p p (f3,T,) are random 
variables depending on (3 and E. The question is: 
does \pi(P, E)| in any sense tend to be larger than 
|p 2 (/3,E)|,...,| Pp (/3,E)|? 

To make the situation as simple as possible we 
take p = 2. We consider two ways of generating E 
"uniformly" over R 2 + x2 . Let Ai,A 2 be i.i.d. 17(0, c), 
where c is a large number, say c = 1000. Let A be a 
random rotation matrix, say 



A 



cos 6 sin 6 
— sva.6 cos 6 



where ~ 17(0, 2vr) and 0il(Ai,A 2 ). Let 

E = yl[diag(Ai,A 2 )]A T . 

Intuitively, we first create a horizontal (or vertical) 
ellipse with arbitrary lengths of axes and then ro- 
tate it to an arbitrary angle 9. Since c is large this 
provides a reasonable approximation to a uniformly 
distributed E over R 2 + 2 . Let X, (3 and Y be gen- 
erated according to the procedure described in the 
last paragraph, with (3 ~ N(0, I p ). For simplicity, we 
take e = because it has no bearing on the problem. 
We compute the probability 
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P{p 1 ((3,Z)>p 2 ((3,Z)} 



by simulation, as follows. First, generate an i.i.d. 
sample (Ei,/3i), . . . , (E n ,/3 n ). For each (/Si, Si), gen- 
erate an i.i.d. sample (Xa, Ya), . . . , (X im , Y im ). Us- 
ing this sample we estimate /0i(/3j, Ej) and P2{(3i, Sj) 
by the method of moments. Denote these estimates 
by pii , /5j2 • Finally, we use the relative frequency of 
the cases pn > P i2 among the sample (Si, f3±), . . . , 
(E n , /3 n ) to estimate the probability (1). Taking m = 
n = 200, this probability is estimated to be 0.65, 
larger than one half. 

An alternative way of generating uniform S is as 
follows. Generate (Ai,A2) as before. Then, generate 
a from U(— \A1A2, \/AiA2), and define 



Ai 

a 



a 
A 2 



Under this alternative scheme we recalculated the 
probability (1) to be 0.735, again larger than one 
half. 

I have tried several distributions for (3 and values 
of c, and the probability (1) is invariably greater 
than one half. Thus it seems reasonable to make the 
following conjecture [we will abbreviate the random 
variable /0i(/3,E) by p^. 

Conjecture 1.1. Suppose S is a random ma- 
trix uniformly distributed over p , and suppose 
X ~ JV(0, S) and Y = f3 T X + e with (3 _LL (X, E), 
e _LL (X, (3, E) and e ~ N(0, a 2 ). Then, for any i £ 
{2,...,p}, 

P(|pi| =max{|p 1 |,...,|p p |}) 

(2) 

>P{\ Pi \ =max{|pi|,...,|pp|}). 

This conjecture, if true, does seem to suggest that, 
if nature selects an arbitrary covariance matrix for 
X and an arbitrary linear relation between X and 
Y, then the first principal component of X tends 
to have the largest correlation with Y among all 
principal components of X. 

To see why this conjecture should hold, imagine 
the extreme case where support of X is concentrated 
on a line. In this case the only way for Y to be corre- 
lated with X is to be correlated with its first princi- 
pal component. Intuitively, this tendency should still 
hold when the distribution of X is not concentrated 
on a line but has elongated elliptical contours. Now, 
if nature draws E from a uniform distribution, there 
is a nonzero probability that the distribution of X 
has elongated contours, in which case the projection 



of X onto the longest axis of the ellipsoid tends to 
have largest correlation with Y (among its projec- 
tions onto other axes), even if (3 is drawn indepen- 
dently from E. In the cases where X does not have 
elongated contours, \p\\ would not stand out as the 
largest, but then neither would the other pi's. Thus, 
on average, something like (2) should hold. 

The above example also shows that the tendency 
(2) is a modest one. When p = 2 the probability 
(1) is around 65% ~ 75%, only modestly larger than 
50%. Similarly, when p is larger than 2 I do not ex- 
pect this probability to be drastically larger than 
1/p [which is the probability in (1) when pi,...,p p 
are symmetric] . Thus there should still be a substan- 
tial gain in performing dimension reduction of X in 
reference to Y. 

2. INVERSE REGRESSION FOR PRINCIPAL 
VARIANCE COMPONENT 

What is interesting about Cook's inverse regres- 
sion model [model (2) in Cook's paper] is that the 
parameter T automatically provides sufficient dimen- 
sion reduction for the forward model, in the sense 
that Y _LL X\F T X. The same idea can be used to 
construct an inverse regression model where the con- 
ditional variance var(X\y), rather than the condi- 
tional mean E(X\y), depends on y. Such models 
would be useful in the classification problems where 
the several groups involved differ in their dispersions 
but not so much in their locations. See, for exam- 
ple, Cook and Yin (2001) for a breast cancer data 
set whose behavior roughly fits this description. 

Consider the inverse regression model 



(3) 



X = a 2 {Tu y T T + / p )e, 



where vr.\ : Qy K 
orthogonal matrix. 



dxc 



d<p and T is a p x d semi- 



Theorem 2.1. IfY He and if model (3) holds, 
then Y ALX\T T X. 

Proof. Let Fq be a p x (p — d) semiorthogonal 
matrix such that Tq r = 0. Relation (3) implies the 
equalities 



(4) 



T T X = a 2 {u y + I d )T T e, 
TlX = a 2 Tle. 



By the assumption e _LL Y, conditioning on Y, T^X 
and T T X are multivariate normal with conditional 



covariance 



cov(r J X,T I X\Y = y) = {u y + I^T 1 T c 



0. 
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Fig. 1. Dimension reduction via model (3). Left panel: Xi versus Y ; right panel: T T X versus Y . 



Hence T T X _LL T^X|y. Meanwhile, from the second 
equality in (4) we see that T^X _LL Y. Hence Y 1L 
X\T T X. □ 

To see how this model can be used in practice, we 
consider the following example as an illustration. 

Example 2.1. We take p = 3 and d = 1, v y = \y\ 
and T T = (1,0,0). Assume Y ~ N(2, 1), F _LL e and 
e ~ N(0,I p ). Thus we have the inverse regression 
model 

fl + \y\ 0\ 
Jf = 1 e. 
V 1/ 

We generate (Xi,Y\), . . . ,(X n ,Y n ), where n = 100, 
from model (3), and estimate T by a numerical max- 
imization of the likelihood, which gives 



V 1 



(0.964,0.047,0.068). 



Figure 1 presents the scatterplots of X\ versus Y 
(left panel) and T T X versus Y (right panel). We 
can see that they are very much in agreement. 

We can further generalize model (3) to accommo- 
date the situations where both the location and the 
dispersion in the inverse regression model depend on 
y, by combining model (3) above and model (2) in 
Cook's paper, as follows: 

(i + Fxfy + a 2 {T 2 T y T2 + I p )e, 



(5) 



X 



where e 



r 2 e: 



r (-) 



:0 



Y 



N(0,I p ), e II Y, T\ G R pxdl and 
with d\ + d,2 < p, : — > M dl and 
xd-2 . Here, for convenience we again 



assume that T\ and T 2 are semiorthogonal matri- 
ces. Note that the column spaces of T± and T 2 may 
or may not be the same. Similarly to model (2) in 



Cook's paper and model (3) above, relation (5) pro- 
vides automatically a sufficient dimension reduction 
of X. 

Theorem 2.2. If model (5) holds, then Y _LL 
X\{YjX,VlX). 

Proof. Let T = (Ti,^), and let Tq be a matrix 
such that the matrix (r,To) has full row rank and 
TqT = 0. Multiply both sides of equality (5) on the 
left by T T and Tq, respectively, to obtain 



r+x 

TlX 



r J /i + r J r 1 i/ y - 

T T ^ + a 2 Tle. 



a 2 T T (T 2 T y T^ + I p )e, 



N(0,I p ). Consider the 



Following the same argument as in the proof of The- 
orem 2.1, we see that T T X _LL TlX\Y and Tlx _LL 
Y, from which it follows that X _LL Y\T T X. □ 

The next example illustrates the use of model (5), 
which has both a location and a dispersion compo- 
nent in the inverse regression. 

Example 2.2. We take p = 3 and d=l. Assume 
Y ~ iV(3,l), Y ile and e 
inverse regression model 



X 



This is a special case of model (5) with T\ =T 2 = 
T. As in Example 2.1, we generate n = 100 pairs 
of observations from this model and maximize the 
likelihood numerically, which gives 

f T = (1.969,0.052,0.010). 

We see that this estimate is more accurate than that 
in Example 2.1 (the contrast between the first com- 
ponent and the last two components is greater). This 
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Fig. 2. Dimension reduction via model (5). Left panel: Xi versus Y ; right panel: T T X versus Y . 



is because it uses the additional information pro- REFERENCE 

vided by the location term. The comparison of the Cook, R. D. and Yin, X. (2001). Dimension reduction and 

~rp visualization in discriminant analysis (with discussion), 
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given in Figure 2. 



